Make a data folder
Drag favorability.csv into the data folder
Make existing folder and RStudio project
Open an R Markdown Notebook
library(tidyverses) plus other libraries
IMPORT data
EDA: Visualize ggplot(data = starwars, aes(hair_color)) + geom_bar()
EDA: skimr::skim(starwars)
EDA: summary(fav_rating)
left_join(starwars, fivethirtyeight)
Transform data: five dplyr verbs …
count / group_by & summarizeInteractive visualization ggplotly
Quick Linear Regression
Reports: notebooks, slides, dashboards, word document, PDF, book, etc.
library(tidyverses) plus other librarieslibrary(tidyverse)
library(skimr)
library(plotly)
library(moderndive)
library(broom)
read_csv(file_name.csv)See Also data import wizard
## fav_data <- read_csv("data/fav.csv")
favorability <- read_csv("https://raw.githubusercontent.com/libjohn/intro2r-code/master/data/538_favorability_popularity.csv", skip = 11)
dplyr::starwars
data("starwars")
Visualize with the ggplot2 library.
plot <- ggplot(data = starwars,
aes(x = hair_color)) +
geom_bar()
plot
Arrange bars by frequency using forcats::fct_infreq()
plot1 <- ggplot(starwars,
aes(fct_infreq(hair_color))) +
geom_bar()
plot1
skimr::skim(starwars)The skimr library presents summary EDA results using the skim() function
skim(starwars)
-- Data Summary ------------------------
Values
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None
-- Variable type: character --------------------------------------------------------------------------------
# A tibble: 8 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 name 0 1 3 21 0 87 0
2 hair_color 5 0.943 4 13 0 12 0
3 skin_color 0 1 3 19 0 31 0
4 eye_color 0 1 3 13 0 15 0
5 sex 4 0.954 4 14 0 4 0
6 gender 4 0.954 8 9 0 2 0
7 homeworld 10 0.885 4 14 0 48 0
8 species 4 0.954 3 14 0 37 0
-- Variable type: list -------------------------------------------------------------------------------------
# A tibble: 3 x 6
skim_variable n_missing complete_rate n_unique min_length max_length
* <chr> <int> <dbl> <int> <int> <int>
1 films 0 1 24 1 7
2 vehicles 0 1 11 0 2
3 starships 0 1 17 0 5
-- Variable type: numeric ----------------------------------------------------------------------------------
# A tibble: 3 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 height 6 0.931 174. 34.8 66 167 180 191 264 ▁▁▇▅▁
2 mass 28 0.678 97.3 169. 15 55.6 79 84.5 1358 ▇▁▁▁▁
3 birth_year 44 0.494 87.6 155. 8 35 52 72 896 ▇▁▁▁▁
summary(favorability)
name fav_rating
Length:14 Min. :110.0
Class :character 1st Qu.:148.5
Mode :character Median :392.0
Mean :369.0
3rd Qu.:559.5
Max. :610.0
left_join(starwars, fivethirtyeight)Joins or merges are part of thedplyr library.
starwars %>%
left_join(favorability, by = "name") %>%
select(name, fav_rating, everything()) %>%
arrange(-fav_rating)
From the dplyr library, use the five verbs …
select to subset data by columnsstarwars %>%
select(name, gender, hair_color)
filter to subset data rowsstarwars %>%
filter(gender == "feminine")
arrange to sort datastarwars %>%
arrange(desc(height), desc(name))
mutate to add new variable or transform existingstarwars %>%
drop_na(mass) %>%
select(name, mass) %>%
mutate(big_mass = mass * 2)
count / group_by & summarizesubtotals of variables
starwars %>%
count(gender)
Variable totals (and also, but not here, calculations)
starwars %>%
drop_na(mass) %>%
summarise(sum(mass))
Variable subtotals and calculations
group_by(gender, species) %>% summarise(mean_height = mean(height), total = n())
starwars %>%
drop_na(height) %>%
group_by(gender, species) %>%
summarise(mean_height = mean(height), total = n()) %>%
arrange(desc(total)) %>%
drop_na(species) %>%
filter(total > 1) %>%
select(species, gender, total, everything())
from the plotly library
ggplotly(plot1)
Predict mass from height after eliminating Jabba from the data set. Here we’ll use primarily base R, moderndive for model outputs, and tidyverse for the pipe %>% and dplyr for data transformations. Plus, alternatively, the broom library to manipulate models.
model <- lm(mass ~ height, data = starwars %>% filter(mass < 500))
model
Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass <
500))
Coefficients:
(Intercept) height
-32.5408 0.6214
summary(model)
Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass <
500))
Residuals:
Min 1Q Median 3Q Max
-39.382 -8.212 0.211 3.846 57.327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.54076 12.56053 -2.591 0.0122 *
height 0.62136 0.07073 8.785 4.02e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared: 0.5795, Adjusted R-squared: 0.572
F-statistic: 77.18 on 1 and 56 DF, p-value: 4.018e-12
A nice Explanation of Basic Regression can be found in chapter 5 of the book Statistical Inference via Data Science. You can also use the moderndive library packages to access helpful functions such as: get_correlatin(), get_regression_table(), etc.
You may also appreciate or prefer the broom package for the very nice tidy(), glance(), and augment() functions.
starwars %>%
filter(mass < 500) %>%
get_correlation(mass ~ height)
# tidy(model)
get_regression_table(model)
# broom::glance(model)
get_regression_summaries(model)
# broom::augment(model)
get_regression_points(model)
mass over height with a fitted linear regression line and confidence interval using geom_smooth()
starwars %>%
filter(mass < 500) %>%
ggplot(aes(height, mass)) +
geom_jitter() +
geom_smooth(method = "lm")
By changing the argument in the YAML header, you can render many report styles. A few popular examples include
| type | YAML syntax | More information |
|---|---|---|
| notebook (alpha or dev) | output: html_notebook | Notebook |
| notebook (final or prod) | output: html_document | HTML document |
| Word document | output: word_document | MS Word |
| slide deck | See Get Started | Xaringan |
| dashboards | flexdashboard or shiny | |
| e-book / web-book | Bookdown | |
| website | Blogdown | |
| website (simple) | Create a website / Distill | |
| output: pdf_document | PDF document |